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Abstract.  Many  techniques  have  been  introduced  in  the  last  few  decades 
to  create  e-free  automata  representing  regular  expressions:  Glushkov  au¬ 
tomata,  the  so-called  follow  automata,  and  Antimirov  automata.  This 
paper  presents  a  simple  and  unified  view  of  all  these  e-free  automata 
both  in  the  case  of  unweighted  and  weighted  regular  expressions.  It  de¬ 
scribes  simple  and  general  algorithms  with  running  time  complexities 
at  least  as  good  as  that  of  the  best  previously  known  techniques,  and 
provides  concise  proofs.  The  construction  methods  are  all  based  on  two 
standard  automata  algorithms:  epsilon-removal  and  minimization.  This 
contrasts  with  the  multitude  of  complicated  and  special-purpose  tech¬ 
niques  and  proofs  put  forward  by  others  to  construct  these  automata. 

Our  analysis  provides  a  better  understanding  of  e-free  automata  repre¬ 
senting  regular  expressions:  they  are  all  the  results  of  the  application  of 
some  combinations  of  epsilon-removal  and  minimization  to  the  classical 
Thompson  automata.  This  makes  it  straightforward  to  generalize  these 
algorithms  to  the  weighted  case,  which  also  results  in  much  simpler  algo¬ 
rithms  than  existing  ones.  For  weighted  regular  expressions  over  a  closed 
semiring,  we  extend  the  notion  of  follow  automata  to  the  weighted  case. 

We  also  present  the  first  algorithm  to  compute  the  Antimirov  automata 
in  the  weighted  case. 

1  Introduction 

The  construction  of  finite  automata  representing  regular  expressions  has  been 

widely  studied  due  to  its  multiple  applications  to  pattern-matching  and  many 
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other  areas  of  text  processing  [1,22].  The  most  classical  construction,  Thomp¬ 
son’s  construction  [14,25],  creates  a  finite  automaton  with  a  number  of  states 
and  transitions  linear  in  the  length  m  of  the  regular  expression.  Figure  1(a) 
shows  an  example.  The  time  complexity  of  the  algorithm  is  also  linear,  0(m). 
But  Thompson’s  automaton  contains  transitions  labeled  with  the  empty  string 
e  which  create  a  delay  in  pattern  matching. 

Many  alternative  techniques  have  been  introduced  in  the  last  few  decades  to 
create  e-free  automata  representing  regular  expressions,  in  particular,  Glushkov 
automata  [11],  follow  automata  [13],  and  Antimirov  automata  [2]. 

The  Glushkov  automaton,  or  position  automaton,  was  independently  intro¬ 
duced  by  [11]  and  [17].  Figure  1(b)  shows  an  example  for  a  particular  regular 
expression.  The  automaton  has  exactly  n  +  1  states  but  up  to  n2  transitions, 
where  n  is  the  number  of  occurrences  of  alphabet  symbols  appearing  in  the  ex¬ 
pression.  For  a  reasonable  expression,  m  =  0(n),  making  it  quadratically  larger 
than  the  Thompson  automaton.  When  using  bit-parallelism  for  regular  expres¬ 
sion  search,  due  to  its  smaller  number  of  states,  the  Glushkov  automaton  can  be 
represented  with  half  the  number  of  machine  words  required  by  the  Thompson 
automaton  [21,22], 

Several  techniques  have  been  suggested  for  constructing  the  Glushkov  au¬ 
tomaton.  In  [3] ,  the  construction  is  based  on  the  recursive  definition  of  the  follow 
function  and  has  a  complexity  of  0(n3).  The  algorithm  described  by  [4]  has  com¬ 
plexity  of  0{m  +  n 2)  and  is  based  on  an  optimization  of  the  recursive  definition 
of  the  follow  function.  It  requires  the  expression  to  be  first  rewritten  in  star- 
normal  form,  which  can  be  done  non-trivially  in  0(m).  Several  other  quadratic 
algorithms  have  been  given:  that  of  [9]  which  is  based  on  an  optimization  of  the 
follow  recursion,  and  that  of  [23],  based  on  the  ZPC  structure,  which  consists  of 
two  mutually  linked  copies  of  the  syntactic  tree  of  the  expression. 

The  Antimirov  or  partial  derivatives  automaton  was  introduced  by  [2].  Fig¬ 
ure  1(d)  shows  an  example.  It  is  in  general  smaller  than  the  Glushkov  automaton 
with  up  to  n  +  1  states  and  up  to  n2  transitions.  It  was  in  fact  proven  by  [8]  (see 
[13]  for  a  simpler  proof)  to  be  the  quotient  of  the  Glushkov  automaton  for  some 
equivalence  relation.  The  complexity  of  the  original  construction  algorithm  by 
[2]  is  0(m5).  [8]  presented  an  algorithm  whose  complexity  is  0(m2). 

Finally,  the  follow  automaton  was  introduced  by  [13],  it  is  the  quotient  of  the 
Glushkov  automaton  by  the  follow  equivalence:  two  states  are  equivalent  if  they 
have  the  same  follow  and  the  same  finality.  Figure  1(c)  presents  an  example.  The 
author  gave  an  0(in  +  n2)  algorithm  where  some  e-transitions  are  removed  from 
the  automaton  at  each  step  of  the  construction  of  the  Thompson  construction  as 
well  as  at  the  end.  An  0{m  +  n2)  algorithm  using  the  ZPC  structure  was  given 
in  [7],  which  requires  the  regular  expression  to  be  rewritten  in  star-normal  form. 

Some  of  these  results  have  been  extended  to  weighted  regular  expressions 
over  arbitrary  semirings.  The  generalization  of  the  Thompson  construction  triv¬ 
ially  follows  from  [24],  The  Glushkov  automaton  can  be  naturally  extended  to 
the  weighted  case  [5],  and  an  0(m2)  construction  algorithm  based  on  the  gener¬ 
alization  of  the  ZPC  construct  was  given  by  [6].  The  Antimirov  automaton  was 


generalized  to  the  weighted  case  by  [16],  but  no  explicit  construction  algorithm 
or  complexity  analysis  was  given  by  the  authors. 

This  paper  presents  a  simple  and  unified  view  of  all  these  e-free  automata 
(Glushkov,  follow,  and  Antimirov)  both  in  the  case  of  unweighted  and  weighted 
regular  expressions.  It  describes  simple  and  general  algorithms  with  running  time 
complexities  at  least  as  good  as  that  of  the  best  previously  known  techniques,  and 
provides  concise  proofs.  The  construction  methods  are  all  based  on  two  standard 
automata  algorithms:  epsilon-removal1  and  minimization  as  summarized  by  the 
following  table: 


Automaton 

Algorithm 

Complexity 

Glushkov 

Follow 

Antimirov 

rmeps(T) 

0(mri) 

0(mn) 

0(m  log  m  +  mn) 

min(rmeps(T)) 

fmeps(min(rmeps(T))) 

Where  T  is  the  Thompson  automaton,  T  is  the  automaton  derived  from  T 
by  marking  alphabet  symbols  with  their  position  in  the  expression.  When  the 
symbols  are  marked,  the  same  notation  denotes  the  operation  that  removes  the 
marking.  T  is  obtained  by  marking  some  e-transitions  in  T,  making  it  determin¬ 
istic  (the  e-transitions  marked  are  removed  by  the  fmeps  operation). 

This  contrasts  with  the  multitude  of  complicated  and  special-purpose  tech¬ 
niques  and  proofs  put  forward  by  others  to  construct  these  automata.  No  need 
for  fine-tuning  some  recursions,  no  requirement  that  the  regular  expression  be 
in  star-normal  form,  and  no  need  to  maintain  multiple  copies  of  the  syntactic 
tree. 

Our  analysis  provides  a  better  understanding  of  e-free  automata  representing 
regular  expressions:  they  are  all  the  results  of  the  application  of  some  combina¬ 
tions  of  epsilon-removal  and  minimization  to  the  classical  Thompson  automata. 
This  makes  it  straightforward  to  generalize  these  algorithms  to  the  weighted 
case  by  using  the  generalization  of  e-removal  and  minimization  [18,19].  This 
also  results  in  much  simpler  algorithms  than  existing  ones. 

In  particular,  this  leads  to  a  straightforward  algorithm  for  the  construction 
of  the  Glushkov  automaton  of  a  weighted  regular  expression,  and,  in  the  case 
of  closed  semirings,  allows  us  to  generalize  the  notion  of  follow  automaton  to 
the  weighted  case.  We  also  give  the  first  explicit  construction  algorithm  of  the 
Antimirov  automaton  of  a  weighted  expression.  When  the  semiring  is  fc-closed  (or 
only  e-fc-closed  for  the  regular  expression  in  the  Glushkov  case),  the  complexities 
of  the  construction  algorithms  are  the  same  as  in  the  unweighted  case. 

2  Preliminaries 

Semirings  A  semiring  (K,  ©,  <8>,  0, 1)  is  a  ring  that  may  lack  negation.  IK  is 
closed  if  a*  =  ©n>o«n  is  defined  for  all  a  €  IK,  and  k- closed  if  there  exists 

1  e-removal  is  less  well  known  as  an  algorithm  because  it  has  often  been  and  continue 
to  be  only  presented  as  part  of  determinization  in  many  textbooks. 


Fig.  1.  (a)  The  Thompson  automaton,  (b)  Glushkov  automaton,  (c)  Follow  automaton, 
and  (d)  Antimirov  automaton  representing  the  regular  expression  a  =  (a  +  b)(a*  +ba*  + 
&*)*.  This  regular  expression  is  the  running  example  from  [13]. 


k  >  0  such  that  a*  =  ak  for  all  a  €  K.  Examples  of  semirings  are  the  boolean 
semiring  (B,  V,  A,  0, 1),  the  tropical  semiring  (R+  U  {oo},  min,  +,  oo,  0),  and  the 
real  semiring  (R+,  +,  x , 0, 1). 


Weighted  automata  A  weighted  automaton  A  over  a  semiring  1  is  a  7-uple 
(E,  Q,  E,  J,  F,  A,  p)  where:  E  is  a  finite  alphabet;  Q  is  a  finite  set  of  states;  I  C  Q 
the  set  of  initial  states;  F  C  Q  the  set  of  final  states;  E  C  Q  x  (E  U  {e})  xKxQ 
a  finite  set  of  transitions;  A  :  I  — ►  K  the  initial  weight  function;  and  p  :  F  — »  IK 
the  final  weight  function  mapping  F  to  K. 

Given  a  transition  e  G  E,  we  denote  by  i[e]  its  input  label,  p[e]  its  origin  or 
previous  state  and  n[e]  its  destination  state  or  next  state,  w[e ]  its  weight.  Given 
a  state  q  G  Q,  we  denote  by  E[q]  the  set  of  transitions  leaving  q. 

A  path  7r  =  ei  •  •  •  efe  is  an  element  of  E*  with  consecutive  transitions:  n[e*_i]  = 
p[e,],  i  =  2 We  extend  n  and  p  to  paths  by  setting:  n[7r]  =  n[e/ s]  and 
p[n\  =  p[e i].  A  cycle  7r  is  a  path  whose  origin  and  destination  states  coincide: 
n[7r]  =  p[ 7r].  We  denote  by  P(q,  q')  the  set  of  paths  from  q  to  q1  and  by  P(q,  x,  q') 
the  set  of  paths  from  q  to  q'  with  input  label  x  £  E*  .  These  definitions  can 
be  extended  to  subsets  R,  R'  C  Q1  by:  P(R,x,Rr)  =  U q&ft,^q'&u'P(q,x,q').  The 
labeling  function  i  and  the  weight  function  w  can  also  be  extended  to  paths: 
i [vr]  =  i[e i]  •  •  -i[ek],  w[n]  =  w[e i]  •  •  •  <S>  w[ek\-  The  weight  associated  by  A  to 


each  input  strings  €  A*  is  [A]  (a:)  =  ®k^p(i,x,f)  A(p[7r])0w[7r]®p(n[7r]),  [A]  (a:) 
is  defined  to  be  0  when  P(I,  x,  F)  =  0. 

General  algorithms  Let  A  be  a  weighted  automaton  over  K.  The  shortest  distance 
from  p  to  q  is  defined  as  d[p.q]  =  gP(p  q-j  w[tt\.  It  can  be  computed  using  the 
generic  single-source  shortest-distance  algorithm  of  [20]  if  K  is  fc-closed  for  A ,  or 
using  a  generalization  of  Floyd- Warshall  [15, 20]  if  K  is  closed  for  A. 

The  general  e-removal  algorithm  of  [19]  consists  of  first  computing  the  e- 
closure  of  each  state  p  in  A, 

closure(p)  =  {(q,w)\w  =  de\p,q]  =  0  w[n\  ±  0},  (1) 

TT(zP(p,q),i[ir]=e 


and  then,  for  each  state  p ,  of  deleting  all  the  outgoing  e-transitions  of  p,  and 
adding  out  of  p  all  the  non-e  transitions  leaving  each  state  q  £  closure(p)  with 
their  weight  pre-0-multiplied  by  de\p,  q].  If  K  is  fc-closed  for  the  e-cycles  of  A,2 
then  the  generic  single-source  shortest-distance  algorithm  [20]  can  be  used  to 
compute  the  e-closures. 

Weight  pushing  [18]  is  a  normalization  algorithm  that  redistribute  the  weights 
along  the  paths  of  A  such  that  ©e£E[g]  w[e]  +  p(q)  =  1  for  every  state  q  £  Q, 
we  will  denote  by  push(A)  the  resulting  automaton.  The  algorithm  requires 
that  K  is  zero-sum  free,  weakly  left  divisible  and  closed  or  fc-closed  for  A  since 
it  depends  on  the  computation  of  d[q,F]  for  all  q  £  Q.  It  was  proved  in  [18] 
that,  if  A  is  deterministic  ( i.e .  if  no  two  transitions  leaving  any  state  share  the 
same  label  and  if  it  has  a  unique  initial  state),  then  the  algorithm  consisting 
of  weight  pushing  followed  by  unweighted  minimization  (considering  the  pairs 
(label, weight)  as  a  single  symbol)  leads  to  a  minimal  automaton  equivalent  to 
A,  denoted  by  min(A). 

See  figure  2  for  an  illustration  of  these  algorithms,  more  detailed  descriptions 
are  given  in  the  appendix. 

Regular  expressions  A  weighted  regular  expression  over  the  semiring  K  is  recur¬ 
sively  defined  by:  0,  e  and  a  £  E  are  regular  expressions,  and  if  a  and  (3  are 
regular  expressions  then  ka ,  ak  for  k  £  K,  a  +  /?,  a  ■  (3  and  a*  are  also  regular 
expressions.  We  denote  by  |a|  the  length  of  a,  and  by  \a\s  the  width  a,  i.e.  the 
number  of  occurrences  of  alphabet  symbols  in  a.  Let  pos(a)  =  {1,  2, . . . ,  |a|i;} 
be  the  set  of  (alphabet  symbol)  positions  in  a.  An  unweighted  regular  expression 
can  be  seen  as  a  weighted  expression  over  the  boolean  semiring  (B,  V,  A,  0, 1). 
We  denote  by  At  (a)  the  Thompson  automaton  of  a  and  by  Iat(ol)  and  F at  (a) 
its  unique  initial  and  final  states.  For  i  £  pos(a),  we  defined  pi  and  qi  as  the 
states  such  that  the  alphabet  symbol  at  the  z-th  position  in  a  corresponds  to 
the  transition  from  pi  to  g*.  These  states  are  the  only  states  having  respectively 
a  non-e  outgoing  or  incoming  transition. 

For  A  to  be  well  defined,  K  needs  to  be  closed  for  the  e-cycles  of  A. 


2 


(c)  (d)  (e) 

Fig.  2.  (a)  A  weighted  automaton  A\  over  the  real  semiring  (R+,+,  x,0, 1).  (b)  The 
result  of  the  application  of  e-removal  to  A.  (c)  A  weighted  automaton  A 2  over  the  real 
semiring  (R+,  +,  x,  0, 1).  (d)  The  result  of  weight  pushing,  (e)  The  result  of  minimiza¬ 
tion.  The  initial  weight  in  the  last  two  automata  is  64. 


3  Glushkov  Automaton 


Let  a  be  a  weighted  regular  expression  over  the  alphabet  S  and  the  semiring 
K.  We  denote  by  a  the  weighted  regular  expression  obtained  by  marking  each 
symbol  of  a  with  its  position.  The  Glushkov  or  position  automaton  Acict)  of  a  is 
is  defined  by  the  7-uple  (X,  pos0(a),  E,  0, 1,  F,  p )  where  pos0(o;)  =  pos(a)  U  {0}, 

E  =  {( i,a,w,j )  :  (J,w)  £  follow(a,  *)  and  pos (a,j)  =  a},  (2) 

and  for  i  £  pos0(a),  i  £  F  iff  there  exist  w  £  K  such  that  (i,w)  £  lasto(a),  and 
then  p(i)  =  w. 

The  functions  null(a)  €  K,  first  (c?)  C  pos(a)  x  K,  last  (a)  C  pos(a)  x  K 
and  follow(a,  i)  C  pos(a)  x  K  are  recursively  dehned  over  the  subterms  of  a  as 
shown  in  the  tables  below.  We  also  define  follow(a,  0)  =  first  (a)  and  lasto(a)  as 
last  (a)  U  {(0,  null  (a))}  if  null  (a)  ^  0,  and  last  (a)  otherwise.  For  X  C  pos(a)  x  K, 
k  £  K  and  i  £  pos(a),  k  ■  X  =  {(i,  k  x  w)\(i,w)  £  X}  if  k  ^  0,  0  •  X  =  0  (X  ■  k 
is  defined  similarly),  and  (X,i)  =  w  if  there  exists  w  such  that  (i,w)  £  X,  and 
(X,  i)  =  0  otherwise.  The  union  of  two  weighted  subsets  X  and  Y  is  defined  by 
AUF  =  {(*,  (X,i)®(Y,i))\(X,i)®(Y,i)  ^  0}.  For  example,  {(*,  w)}U{(f,  w')}  = 
{(*,  w  ©  w')}. 


null 

first 

last 

0 

e 

df 

k/3 

1 3k 

d  +  7 
P  '  7 
P* 

0 

I 

0 

k  0  null(/3) 
null(/3)  0  k 
null(/3)  0  null(7) 
null(/3)  0  null(7) 
null(/3)* 

0 

0 

{(M)} 
k  ■  first(/3) 
first(/3) 

first(/3)  U  first(7) 
first(/3)  U  null(/3)  •  first(7) 
null(d)*  •  first(/3) 

0 

0 

{(U)} 

last(/3) 

last  (/3)  •  k 

last(/I)  U  last  (7) 

last (/3)  •  null(7)  U  last (7) 

last(/3)  •  null(d)* 

follow(-,  i) 

follow(-,  i) 

0 

e 

at 

kp 

pk 

0 

0 

0 

follow(/3,  i) 
follow(/3,  i ) 

P  +  7 

P  '  7 

P* 

1 

i 

f 

follow(/3,  i)  if  i  £  pos (P) 
follow(7,  i)  if  i  £  pos(7) 
follow(/3,*)  U  (last(/3),i)  •  first (7)  if  i  £  pos(/3) 
follow(7,  i)  if  i  £  pos(7) 

ollow(/3,  i)  U  (last (/?*),  i)  ■  first (7) 

null(a)  =  null(a)  is  the  value  associated  by  a  to  e.  For  a  to  be  well  defined, 
null(/3)*  must  be  defined  for  every  subterm  b*.  There  is  in  fact  a  very  simple 
relationship  between  the  first,  last  and  follow  functions  and  the  e-closures  of  the 
states  in  the  Thompson  automaton  that  admit  a  non-e  incoming  transition. 

Lemma  1.  Let  a  be  a  weighted  regular  expression.  Let  A  =  At  {a).  Then 

(i)  ( i,w )  £  first(a)  iff  (pt,w)  £  closure^); 

(ii)  ( i,w )  £  follow (a,j)  iff  (pi,w)  €  closure^  );  and 
(in)  ( i,w )  €  last  (a)  iff  (Fa,  w)  £  closure^). 

Proof.  The  proof  is  by  induction  on  the  length  of  the  regular  expression.  If  a  =  a, 
a  =  e  or  a  =  0,  then  the  properties  trivially  hold.  Due  to  lack  of  space,  we  will 
only  treat  the  case  a  =  /3-j,  other  cases  can  be  treated  similarly.  Let  A  =  At  (a), 
B  =  At{(3)  and  C  =  At( 7). 

If  a  =  (3  ■  7,  then  closure^-Tt)  =  closures^#)  U  [B][e]  •  closurec(/yi),  thus 
(i)  recursively  holds  since  [B][e]  =  null(/3).  If  j  £  pos(7),  then  closure^ (<Zj )  = 
closur ec(qj).  Otherwise  j  £  pos(/3)  and 

closureA(9j)  =  closure#^)  U  (closures (qj ), Fb)  ■  closurec(Fc)-  (3) 

Thus,  (ii)  and  (iii)  recursively  hold.  □ 

The  following  theorem  follows  directly  from  the  lemma  just  presented. 

Theorem  1.  Let  a  be  a  weighted  regular  expression.  Then: 

Ag{o)  =  rmeps(AT(a))-  (4) 


Let  a  be  a  weighted  regular  expression  a  over  K.  We  will  say  that  IK  is  e-k- closed 
for  a  if  there  exist  k  such  that  for  every  subterm  (3*  of  a,  null  (/I)  *  =  null(/3)fe. 


Lemma  2.  Let  A  be  the  Thompson  automaton  of  a  weighted  regular  expression 
over  a  k-closed  semiring.  There  is  a  queue  discipline  for  which  the  complexity  of 
the  single-source  shortest- distance  algorithm  from  any  state  in  A  is  linear. 

Proof.  We  define  the  subterm  depth  of  a  state  q  in  A  as  the  number  of  subterms 
(3  +  7  and  /3*  it  belongs  to.  We  then  use  a  larger  subterm-depth  first  queue 
discipline.  The  queue  can  be  maintained  in  constant  time  since  (1)  there  is  at 
most  two  states  having  the  same  subterm  depth  in  the  queue  at  anytime  and  (2) 
if  d  is  the  maximal  subterm  depth  of  an  element  in  the  queue  at  a  given  time, 
the  subterm  depth  of  the  state  inserted  next  will  be  d  —  1,  d  or  d  +  1.  □ 

Theorem  2.  Let  a  be  a  weighted  regular  expression  over  a  semiring  K  that  is  e- 
k-closed  for  a.  The  Glushkov  automaton  of  a  can  be  constructed  in  time  0{mn) 
by  applying  e-removal  to  its  Thompson  automaton. 

Proof.  If  IK  is  e-fc-closed  for  a,  then  K  is  /c-closed  for  all  the  paths  considered 
during  the  computation  of  the  e-closures  and,  by  Lemma  2,  each  e-closure  can 
be  computed  in  0(m).  Since  n  +  1  closures  need  to  be  computed,  the  total 
complexity  is  in  0(mn  +  n2)  =  0(mn).  □ 

In  the  unweighted  case,  the  unpublished  manuscript  [10]  showed  that  the 
Glushkov  automaton  could  be  obtained  by  removing  the  e-transitions  from  the 
Thompson  automaton.  However,  the  authors  used  a  special-purpose  e-removal 
algorithm  and  not  the  classical  e-removal  algorithm,  limiting  the  scope  of  their 
results. 


4  Follow  Automaton 

The  follow  automaton  of  an  unweighted  regular  expression  a,  denoted  by  Ap(a) 
was  introduced  by  [13].  It  is  the  quotient  of  Ac(a)  by  the  equivalence  relation 
=f  defined  over  pos0(a)  by: 

.  _  •  -rr  /  {*,  j}  C  lasto(a)  or  {i,j}  0  lasto(a)  =  0,  and  .  . 

1  ~F  1  \  follow(a,  *)  =  follow(a,  j). 

Theorem  3.  For  any  regular  expression  a,  the  following  identities  hold: 

Ap(a)  =  min(4(3(a)) 

Apia)  =  min  (Ag  (a)). 

Note  that  it  is  mentioned  in  [13]  that  minimization  could  be  used  to  construct 
the  follow  automata  but  the  authors  claim  that  the  complexity  of  minimiza¬ 
tion  would  be  in  0(n2logn)  making  this  approach  less  efficient.  The  following 
theorem  shows  that  minimization  has  in  fact  a  better  complexity  in  this  case. 
Observe  that  Ac  (jot)  is  deterministic. 

Theorem  4.  The  time  complexity  of  the  Hopcroft.  ’s  minimization  algorithm  when 
applied  to  Ac  (a)  is  linear,  i.e.,  in  0(n2)  where  n  =  \a\s. 


Proof.  Due  to  space  constraints,  we  will  give  only  a  sketch  of  the  proof.  The 
log  |  Q  |  factor  in  Hopcroft’s  algorithm  corresponds  to  the  number  of  times  the 
incoming  transitions  at  a  given  state  q  are  used  to  split  a  subset  (tentative 
equivalence  class).  In  Ac (ci),  transitions  sharing  the  same  label  have  all  the 
same  destination  state  (the  automaton  is  1-local ),  thus  each  incoming  transition 
of  a  state  q  can  only  be  used  once  to  split  a  subset.  □ 

This  theorem  actually  holds  for  all  1-local  automata. 

This  leads  to  a  simple  algorithm  for  constructing  the  follow  automaton  of  a 
regular  expression  a: 

Af(ci)  =  min(rmeps(A7’(a))).  (6) 

whose  complexity  0(mn)  is  identical  to  that  of  the  more  complicated  and  special- 
purpose  algorithms  of  [13, 7].  When  the  semiring  K  is  weakly  divisible,  zero-sum 
free,  and  closed,  we  can  then  define  the  follow  automaton  of  a  weighted  regular 
expression  a  as:  Ap{a)  =  min^c^a)). 

Theorem  5.  I/K  is  k-closed,  then  Apia)  can  be  computed  in  Ofmn). 

Proof.  The  shortest-distance  computation  required  by  weight  pushing  can  be 
done  in  0(m)  in  the  case  of  At  fa)  and  is  preserved  by  e- removal.  The  weighted 
automaton  push(A(3(ci))  is  1-local  when  considered  as  a  finite  automaton  over 
pairs  (label,  weight),  thus  theorem  4  can  be  applied.  □ 

5  Antimirov  Automaton 

In  the  following  we  will  consider  pairs  (w,  a)  with  w  £  K,  and  we  define  k  ■ 
(■ w,a )  =  (k  <S>  w,a),  ( w,a )  •  k  =  ( w,ak )  and  ( w,a )  ■  (3  =  ( w,a  ■  /3).  These 
operations  can  naturally  be  extended  to  multisets3  of  pairs  (weight,  expression). 

The  partial  derivative  of  a  with  respect  to  a  £  £  is  the  multiset  of  pairs 
(weight,  expression)  recursively  defined  by: 

da{e)  =  da(  1)  =  0  da{/3  +  7)  =  da{/3)  U  dafy) 

da(b)  =  e  if  a  =  b,  0  otherwise  da{(3  ■  7)  =  da{/3)  ■  7  U  null(/3)  •  dafy) 

da(kf3)  =  k  ■  da(f3)  da((3*)  =  null(/?)*  •  da  (/?)  •  (3* 

da(l3k)  =  da((3)  ■  k 

The  partial  derivative  of  a  with  respect  to  the  string  s  €  U*,  denoted  ds(a),  is  re¬ 
cursively  defined  by  dsa (a)  =  da(ds(a)).  Let  D{a)  =  {(3  :  ( w,(3 )  G  9s(a)  with  s  £ 
S*  and  w  £  K}.  Note  that  for  D(a )  to  be  well-defined,  we  need  to  define  when 
two  expressions  are  the  same.  Here  we  will  only  allow  the  following  identities: 
0-a  =  a-  0  =  0,  0  +  a  =  a  +  0  =  0,  Oa  =  aO  =  0,e-a  =  a-  e  =  a,  la  =  al  =  a, 
k(k'a)  =  (k  <g>  k')a,  (afc)fc'  =  a(k  <g)  k')  and  {a  +  /3)  •  7  =  a  ■  7  +  f3  ■  7.  4 

3  By  multisets,  we  mean  that  {(w,  a)}  U  {(«/ ,  a)}  =  {( w ,  a),  (w1 ,  a)}. 

4  These  identities  are  the  trivial  identities  considered  in  [16]  except  for  the  last  two 
which  were  added  to  simplify  our  presentation.  Any  larger  set  of  identities  can  be 
handled  with  our  method  by  rewriting  a  in  the  corresponding  normal  form. 


The  Antimirov  or  partial  derivatives  automaton  of  a  is  defined  by  the  7- 
uple  (E ,D(a),  E,  a,  I,  F,  null)  where  E  =  {(/?,  a,  w,  j)\w  =  (B(w>,-y)eda(0)w'} 
and  F  =  {j3  £  D(a)|null(/3)  ^  0}. 

Let  E  =  E  U  {e+,  e2+,  e\,  e*}.  We  denote  by  At(o()  the  weighted  automaton 
over  E  obtained  by  recursively  marking  some  of  the  e-transitions  of  At  (a)  as 
follows:  if  a  =  (3  +  7,  we  label  by  e^_  (resp.  e^_)  the  e-transition  from  Iat(u)  t° 
Iat(0)  (resp.  Iat(t))',  if  a  =  (3* ,  we  label  by  e*  (resp.  e*)  the  two  e-transitions 

to  Iat{0 )  (resP-  Fat(ol))-  Observe  that  At  (ex)  can  be  viewed  as  an  automaton 
recognizing  the  expression  a  over  E  recursively  defined  by  0  =  0,  e  =  e,  a  =  a, 
k(3  =  k/3,  (3k  =  (3k,  (3  +  7  =  e\(3  +  e^y,  /?  •  7  =  (3  ■  7  and  (3*  =  (e\(3)*e2. 

For  i  £  pos0(a),  we  use  the  same  notation  qi  (with  qo  =  I)  for  the  correspond¬ 
ing  states  in  Hr(a),  At(cx)  and  rmeps(Hr(a)).  For  a  state  q  in  rmeps(HT(a)), 

we  define  by  L(q)  the  language  recognized  from  q  considering  rmeps(^4r(ci))  as 
an  unweighted  automaton  over  pairs  (symbol, weight).  Lemma  3  follows  from  our 
marking  of  the  e-transitions. 

Lemma  3.  Fori  €  pos0(a),  L(qi )  uniquely  defines  a  regular  expression  over  E, 
denoted  by  Si  (or  Sf  when  there  is  an  ambiguity). 

Lemma  4.  For  all  i  €  pos0(a)  and  j  £  pos(a),  we  have  for  pj,  qi  in  At(cx) 
that: 

( Pj,w )  £  closured)  iff  ( w ,  Sj)  £  da(Si).  (7) 

Proof.  The  proof  is  by  induction  on  the  length  of  the  regular  expression.  If 
a  =  a,  a  =  e  or  a  =  0,  then  the  properties  trivially  hold.  Due  to  the  lack  of 
space,  we  will  only  treat  the  case  a  =  (3  ■  7,  other  cases  can  be  treated  similarly. 
Let  A  =  At(ol),  B  =  At((3)  and  C  =  At( 7). 

If  qi  is  in  C,  then  5 f  =  Sf  and  closureAfe)  =  closurecfe)-  Therefore,  if 
( w,pj )  £  closure^ (g,),  pj  is  in  C  and  then  <5“  =  SJ .  Hence  (7)  recursively  holds. 
If  q-i  is  in  B,  then  Sf  =  df  •  7  and  we  have: 

da(Si)  =  da(Sf)  ■  7  U  null((5f )  •  ^(7)  (8) 

closureA(gi)  =  closures^*)  U  null(df )  •  closurec(Fc')-  (9) 

By  induction,  we  have  that  ( Pj,w )  £  closures^.;)  iff  ( w,Sj )  £  da(Sf),  and 
( Pj,w )  £  closurec(Fc)  iff  ( w ,SJ)  £  3a(^o)  =  ^(7).  Hence  (7)  follows.  □ 

Observe  that  So  =  a,  hence  lemma  4  implies  that  the  Si  are  the  derived  terms 
of  a,  more  precisely,  i  >— >  S,  is  a  surjection  from  pos0(a)  onto  D(a).  This  leads 
us  to  the  following  result,  where  miiiig  is  unweighted  minimization  when  each 
pair  (label, weight)  is  treated  as  regular  symbol  and  rmeps  denotes  the  removal 
of  the  marked  e’s. 


Theorem  6.  We  have  Aa(o)  =  fmeps(nhnB(rmeps(HT(a))))- 


Proof.  Note  that  rmeps(AT(ct))  is  deterministic.  During  minimization,  two  states 
qt  and  qj  are  equivalent  iff  L(qi)  =  L(qj),  i.e.  Si  =  Sj  (by  lemma  3).  Hence,  there 

is  a  bijection  between  D(a)  and  the  set  of  states  of  minB(rmeps(AT(a)))  having 
an  incoming  transition  with  label  in  E,  and  hence  between  D(a)  and  the  set  of 
states  of  A  =  nneps(minB(rmeps(Hr(a)))).  Lemma  4  ensures  that  the  transi¬ 
tions  in  A  is  consistent  with  the  definition  of  A  a  (a)  .  □ 

Theorem  7.  If  K  is  e-k-closed,  then  Aa(cx)  can  be  computed  in  0(m\ogm  + 
mn). 

Theorem  7  follows  from  the  fact  that  rmeps(Ax(a))  has  0{m )  states  and 
transitions.  In  the  unweighted  case,  this  complexity  is  a  good  as  the  more  com¬ 
plicated  and  best  known  algorithm  of  [8] . 

In  the  weighted  case,  the  use  of  minimization  over  (label, weight)  pairs  is 
sub-optimal  since  states  that  would  be  equivalent  modulo  a  (gi-multiplicative  fac¬ 
tor  are  not  merged.  When  possible,  using  weighted  minimization  instead  would 
lead  to  a  smaller  automaton  in  general.  Hence,  if  IK  is  closed,  we  can  defined 
the  normalized  Antimirov  automaton  of  a  as  fmeps(minK(rmeps(H7’(o;)))).  This 
automaton  would  always  be  smaller  than  the  Antimirov  automaton  and  the  au¬ 
tomaton  of  unitary  derived  terms  of  [16]5.  If  IK  is  fc-closed,  it  can  be  constructed 
in  0(m  log  m  +  mn) . 

Remark  When  the  condition  about  /c-closedness  (resp.  e-fc-closedness  for  a)  of  IK 
is  relaxed  to  the  closedness  of  K  (resp.  that  a  is  well-defined),  all  our  construc¬ 
tion  algorithms  can  still  be  used  by  replacing  the  generic  single-source  shortest- 
distance  algorithm  with  a  generalization  of  the  Floyd- Warshall  algorithm  [15, 
20],  leading  to  a  complexity  in  0(?n3).  It  is  not  hard  however  to  maintain  the 
quadratic  complexity  by  modifying  the  generic  single-source  shortest-distance 
algorithm  to  take  advantage  of  the  special  topology  of  the  Thompson  automa¬ 
ton. 

In  the  unweighted  case,  every  regular  expression  can  be  staightforwardedly 
rewritten  in  e-normal  form  such  that  m  =  0(n).  In  that  case,  our  0(mn)  or 
0{m  log  to +mn)  complexities  become  0(m  +  n2)  which  is  what  is  often  reported 
in  the  literature. 

6  Conclusion 

We  presented  a  simple  and  unified  view  of  e-free  automata  representing  un¬ 
weighted  and  weighted  regular  expressions.  We  showed  that  standard  unweighted 
and  weighted  epsilon-removal  and  minimization  can  be  used  to  create  the  Glushkov, 
follow,  and  Antimirov  automata  and  that  the  complexities  of  these  algorithms 
match  those  of  the  best  known  algorithms.  This  provides  a  better  understanding 

5  This  automaton  can  be  viewed  in  our  approach  as  the  result  of  a  simpler  form  of 
reweighting  than  weight  pushing,  the  reweighting  used  by  weighted  minimization. 


of  the  e-free  automata  representing  regular  expressions.  It  also  suggests  using 
other  combinations  of  epsilon-removal  and  minimization  for  creating  e-free  au¬ 
tomata.  For  example,  in  some  contexts,  it  might  be  beneficial  to  use  reverse- 
epsilon-removal  rather  than  epsilon-removal  [19].  Note  also  that  the  Glushkov 
automaton  can  be  constructed  on-the-fly  since  Thompson’s  construction  and 
epsilon-removal  both  admit  an  on-demand  implementation. 
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A  General  algorithms 


A.l  Shortest  distance 

A  generic  single-source  sliortest-distance  algorithm  in  weighted  automata  was 
presented  in  [20].  The  algorithm  is  a  generalization  of  the  classical  shortest- 
distance  algorithms.  It  does  not  require  the  semiring  to  be  idempotent.  For  a 
weighted  automaton  A  over  K,  the  condition  for  the  algorithm  to  work  is  that 
K  must  be  fc-closed  for  A,  i.e.  there  exist  k  ©  N  such  that  for  any  cycle  c  in  A, 


shortest-distance(A,  s) 

1  for  each  p  €  Q  do 

2  d[p\  <—  r\p]  <—  0 

3  d[s]  <—  r[s]  <—  1 

4  S^{s} 

5  while  S  ^  0  do 

6  q  *—  head(S') 

7  dequeue),?) 

8  R  <—  r[q] 

9  r[q ]  <—  0 


10 

for  each  e  £  E[q]  do 

11 

if  d[n[e]]  ^  d[n[e]]  ©  {R  ®  w[e])  then 

12 

d[n[e]j  <—  d[n[e]]  ©  (R  ®  w[e]) 

13 

r[n[ej]  <—  r[n[e]]  ©  (R  <8>  w[e]) 

14 

if  n[e]  0  S 

15 

enqueue (S,  n[e]) 

16  d[s] 

<- 1 

Fig.  3.  Pseudocode  of  the  generic  shortest-distance  algorithm. 


The  algorithm  is  also  generic  in  the  sense  that  it  works  with  any  queue 
discipline.  The  pseudocode  of  the  algorithm  is  given  figure  3.  The  complexity  of 
the  algorithm  depends  on  the  queue  discipline  chosen  for  S,  more  precisely  it  is 
in: 

0(\Q\  +  (Te  +  T0  +  C(A))\E\  max N(q)  +  ( C(I )  +  C(X))  £  N(q))  (10) 

9eQ  ita 

where  N(q)  denotes  the  number  of  times  state  q  is  extracted  from  the  queue  S, 
C(X)  the  cost  of  extracting  a  state  from  S,  C(I)  the  cost  of  inserting  a  state  in 
S,  and  C(A)  the  cost  of  an  assignment. 

In  the  case  of  an  acyclic  automaton,  using  the  topological  order  queue  disci¬ 
pline,  the  complexity  of  the  algorithm  is  linear,  i.e.,  0(\Q\  +  \E\).  In  the  case  of 
the  tropical  semiring,  using  Fibonacci  heaps,  the  complexity  of  the  algorithm  is 
0(\E\  +  \Q\  log  |Q|). 


e-removal(A) 


1  for  each  p  £  Q  do 

2  E[p]  ^{e£  E[p\  :  i[e ]  A  t} 

3  for  each  ( q ,  w)  £  C[p]  do  >  C[p]  =  closure(p) 

4  E[p\  <—  E[p\  U  {(p,  a,  w  ®  w' ,  r)  :  ( q ,  a,  w' ,  r)  £  F[g]  and  a  A  e} 

5  if  q  £  F  then 

6  F  <—  F  U  {p} 

7  p[p\  <-  p[p]  ®  (w  <g>  p[qr] 


Fig.  4.  Pseudocode  of  the  e-removal  algorithm. 

A. 2  Epsilon  removal 

Let  A  be  a  weighted  automaton  over  K  with  e-transitions.  Let  Ae  be  the  automa¬ 
ton  obtained  by  deleting  all  the  transitions  not  labeled  by  e  from  A.  A  general 
e-removal  algorithm  based  on  the  generic  shortest  distance  algorithm  presented 
above  was  given  in  [19].  This  algorithms  works  if  the  semiring  K  is  k- closed  for 
Ae. 

The  algorithm  is  divided  in  two  steps.  The  first  step  consists  of  computing 
the  e-closure  of  each  state  p  in  A.  Let  de[p ,  q]  denote  the  e-distance  from  p  to  q, 
for  p,q  €  Q: 

dAy,q\=  0  «•>].  (li) 

7T  £  .P  (  P ,  Q  )  ,  i  [  7T  ]  =  € 

The  e-closure  of  p  is  then  defined  as 

closure(p)  =  {(<?,  de\p,  q])\de[p,  q\  ^0}.  (12) 

The  e-closure  of  p  can  be  computed  by  using  the  generic  shortest-distance  algo¬ 
rithm  on  Ae  with  source  p. 

The  second  step  consist  of,  for  each  state  p  having  at  least  an  incoming 
non-e  transition,  deleting  all  the  outgoing  e-transitions  of  p,  and  adding  out  of 
p  all  the  non-e  transitions  leaving  each  state  q  £  closure(p)  with  their  weight 
pre-£g)-multiplied  by  de  [p,  g] .  The  pseudocode  of  this  second  step  is  given  figure 
4. 

A. 3  Weight  pushing 

Weight  pushing  is  an  algorithm  for  normalizing  the  distribution  of  the  weights 
along  the  paths  of  a  weighted  automata  [18]. 

Let  A  be  a  weighted  automaton  over  K  and  assume  that  K  is  weakly  left 
divisible  and  zero  sum  free.  For  every  state  q  £  Q,  assume  that  the  shortest 
distance  from  q  to  F: 

dF[q]=  0  (w[tt]  ®p(n[7r])) 

KGP(q,F) 


(13) 


is  well  defined  in  K.  The  weight  pushing  algorithm  consists  of  computing  each 
dp[q\  and  of  reweighting  A  in  the  following  way: 

Ve  £  E  such  that  dF[p[e]]  ^  0,  w[e]  <—  dF[p[e]]_1  (g)  ( w[e\  ®  dF[n[e]]) 

Vg  e  I,  _  A[g]  <—  X[q]  <S>dF[q]  (14) 

Vg  e  F  such  that  dp[q ]  ^  0,  p[q]  <—  dF[g]_1  <8>  p[q\ 

The  complexity  of  the  reweighting  step  is  linear  in  the  size  of  A  under  the 
assumption  that  the  cost  of  the  <8>  operation  is  constant.  The  first  step  can  be 
achieve  by  applying  the  shortest-distance  algorithm  on  the  reverse  of  A,  hence 
the  complexity  of  this  step  is  as  discussed  in  section  A.l. 

Weight  pushing  has  two  interesting  properties:  (1)  it  does  no  change  the 
weight  of  successful  paths,  (2)  the  resulting  weighted  automaton  is  stochastic , 
i.e.  for  any  state  g,  the  ©-sum  of  the  weight  of  the  outgoing  transitions  in  q  is 
equal  to  1. 

A. 4  Weighted  minimization 

A  weighted  automaton  A  is  deterministic  if  no  two  transitions  leaving  any  state 
share  the  same  label  and  if  it  has  a  unique  initial  state.  A  deterministic  weighted 
automaton  is  minimal  if  there  exists  no  other  deterministic  automaton  having  a 
smaller  number  of  states  and  realizing  the  same  function. 

A  general  weighted  minimization  was  presented  in  [18].  Let  A  be  a  weighted 
automaton  over  K,  the  algorithm  consists  of  the  execution  of  the  following  steps: 

1.  weight  pushing, 

2.  (unweighted)  automata  minimization,  considering  each  pair  (label,  weight) 
as  a  single  label. 

Assuming  that  the  conditions  of  application  of  weight  pushing  hold,  the  resulting 
weighted  automaton,  denoted  by  min(A),  is  minimal  and  equivalent  to  A.  The 
complexity  of  the  second  step  is  in  0(\E\  log  |Q|)  using  the  Hopcroft  algorithm 
[12]- 


